Appendix C — Assignment C

Pandas

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Do not write your name on the assignment.
Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.
Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.
The assignment is worth 100 points, and is due on 22nd October 2023 at 11:59 pm. There is an optional bonus question worth 10 points. You can score a maximum of 110 (out of 100) points.
You are not allowed to use a for loop or any other kind of loop in this assignment.

C.1 GDP per capita & social indicators

Read the file social_indicator.txt with python. Set the first column as the index when reading the file. How many observations and variables are there in the data?

(4 points)

C.1.1

Which variables have the strongest and weakest correlations with GDP per capita? Note that lifeFemale and lifeMale are the female and male life expectancies respectively.

Note that only when the magnitude of the correlation is considered when judging a correlation as strong or weak.

(4 points)

C.1.2

Does the male economic activity (in the column economicActivityMale) have a positive or negative correlation with GDP per capita? Did you expect the positive/negative correlation? If not, why do you think you are observing that correlation?

(4 points)

C.1.3

What is the rank of the US amongst all countries in terms of GDP per capita? Which countries lie immediately above, and immediately below the US in the ranking in terms of GDP per capita? The country having the highest GDP per capita ranks 1.

Note that:

The US is mentioned as United States in the data.
The country with the highest GDP per capita will have rank 1, the country with the second highest GDP per capita will have rank 2, and so on.

Hint: rank()

(4 points)

C.1.4

Which country or countries rank among the top 20 in terms of each of these social indicators - economicActivityFemale, economicActivityMale, gdpPerCapita, lifeFemale, lifeMale? For each of these social indicators, the country having the largest value ranks 1 for that indicator.

(6 points)

Hint:

Use rank(). Note that this method is different from the method given in the hint of the previous question. This method is of the DataFrame class, while the one in the previous question is of the Series class. This part of the hint is just for your understanding. You don’t need to write any code in this part.
Using rank(), get the DataFrame consisting of the ranks of countries on each of the relevant social indicators (one line of code).
In the DataFrame obtained in (2), filter the rows for which the maximum rank is less than or equal to 20 (one line of code).

C.1.5

On which social indicator among economicActivityFemale, economicActivityMale, gdpPerCapita, lifeFemale, lifeMale, illiteracyFemale, illiteracyMale, infantMortality, and totalfertilityrate does the US have its worst ranking, and what is the rank? Note that for illiteracyFemale, illiteracyMale, and infantMortality, the country having the lowest value will rank 1, in contrast to the other social indicators.

(8 points)

C.1.6

Find all the countries that have a lower GDP per capita than the US, despite having lower illiteracy rates (for both genders), higher economic activity (for both genders), higher life expectancy (for both genders), and lower infant mortality rate than the US?

(6 points)

C.2 GDP per capita vs social indicators

We’ll use the same data as in in the previous question. For the questions below, assume that all numeric columns, except GDP per capita, are social indicators.

C.2.1

Use the column geographic_location to create a new column called continent. Merge the values of the geographic_location column appropriately to obtain 6 distinct values for the continent column – Asia, Africa, North America, South America, Europe and Oceania. Drop the column geographic_location. Print the first 5 observations of the updated DataFrame.

(8 points)

Hint:

Use value_counts() to see the values of geographic_location. The code if 'Asia' in 'something' will return True if ‘something’ contains the string ‘Asia’, for example, if ‘something’ is ‘North Asia’, the code with return True. This part of the hint is just for your understanding. You don’t need to write any code in this part.
Apply a lambda function on the Series geographic_location to replace a string that contains ‘Asia’ with ‘Asia’, replace a string that contains ‘Europe’ with ‘Europe’, and replace a string that contains ‘Africa’ with ‘Africa’. This will be a single line of code.
Rename the column georgaphic_location to continent.

C.2.2

Sort the column labels lexicographically. Drop the columns region and contraception. Print the first 5 observations of the updated DataFrame.

Hint: sort_index()

(4 points)

C.2.3

Find the percentage of the total countries in each continent.

Hint: One line of code with value_counts() and shape

(4 points)

C.2.4

Which country has the highest GDP per capita? Let us call it country \(G\).

(4 points)

C.2.5

We need to find the African country that is the closest to country \(G\) with regard to social indicators. Perform the following steps:

C.2.5.1

Standardize each of the social indicators to a standard normal distribution so that all of them are on the same scale (remember to exclude GDP per capita from social indicators).

Hint:

For scaling a random variable to standard normal, subtract the mean from each value of the variable, and divide by its standard deviation.
Use the apply method with a lambda function to scale all the social indicators to standard normal.
The above (1) and (2) together is a single line of code.

(6 points)

C.2.5.2

Compute the Manhattan distance between country \(G\) and each of the African countries, based on the scaled social indicators.

Hint:

Broadcast a Series to a DataFrame
The Manhattan distance between two points \((x_1, x_2, ..., x_p)\) and \((y_1, y_2, ..., y_p)\) is \(|x_1 - y_1| + |x_2 - y_2| + ... + |x_p-y_p|\), where \(|.|\) stands for absolute value (for example, \(|-2| = 2; |3| = 3\)).

(8 points)

C.2.5.3

Identify the African country, say country \(A\), with the least Manhattan distance to country \(G\).

(8 points)

C.2.6

Find the correlation between the Manhattan distance from country \(G\) and GDP per capita for African countries.

(6 points)

C.2.7

Based on the correlation coefficient in \(2(f)\), do you think African countries should try to emulate the social characteristics of country \(G\)? Justify your answer.

(4 points)

C.3 Medical data

Read the data sets conditions.csv and patients.csv. Suppose we are interested in studying patients with prediabetes condition. Do not drop or compute any missing values. In condition.csv, the patient IDs are stored in column PATIENT, and the medical conditions are stored in column DESCRIPTION. In patient.csv, the patient IDs are stored in column Id.

C.3.1

Print the patient IDs of all the patients with prediabetes condition.

(4 points)

C.3.2

Make a subset of the data with only prediabetes patients. How many prediabetes patients are there?

(4 points)

Hint: .isin()

C.3.3

What proportion of the total HEALTHCARE_EXPENSES of all the patients correspond to the HEALTHCARE_EXPENSES of prediabetes patients.

(4 points)

C.4 Bonus question

This is an optional question with no partial credit. You will get points only if your solution is completely correct. We advise you to attempt it only when you are done with the rest of the assignment.

(10 points)

Read the file STAT303-1 survey for data analysis.csv. In this question, we’ll work to clean this data a bit. As with every question, you are not allowed to use a for loop or any other loop.

Execute the following code to read the data and clean the column names.

Code

survey_data = pd.read_csv('STAT303-1 survey for data analysis.csv')
new_col_names = ['parties_per_month', 'smoke', 'weed', 'introvert_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
survey_data.columns = list(survey_data.columns[0:2])+new_col_names

Check the datatype of the variables using the dtypes attribute of the Pandas DataFrame object. You will notice that only two variables are numeric. However, if you check the first few observations of the data with the function head() you will find the there are several more variables that seem to have numeric values.

Write a function that accepts a Pandas Series (or a column of a Pandas DataFrame object) as argument, and if the datatype of the Series is non-numeric, does the following:

Checks if at least 10 values of the Series contain a digit in them.
If at least 10 values are found to contain a digit, then:

A. Eliminate the characters ~, +, and , from all the values of the Series.

B. Convert the Series to numeric (with coercion if needed).
If at least 10 values are NOT found to contain a digit, then:

A. If the values of the Series are ‘Yes’ and ‘No’, then replace ‘Yes’ with 1 and ‘No’ with 0. The Series datatype must change to numeric as well.

B. If the values of the Series are ‘Agree’ and ‘Disagree’, then replace ‘Agree’ with 1 and ‘Disagree’ with 0. The Series datatype must change to numeric as well.

Apply the function to each column of survey_data using the apply() method. Save the updated DataFrame as survey_data_clean. Then, execute the following code.

Code

survey_data_clean.describe().loc['mean',:]

The above code should print out the mean values of 28 numeric columns in the data survey_data_clean.

The variables that you should see as numeric in survey_data_clean are given in the list numeric_columns below (this is just to check your work):

Code

numeric_columns = ['parties_per_month', 'love_first_sight', 'num_insta_followers',
       'expected_marriage_age', 'expected_starting_salary',
       'minutes_ex_per_week', 'sleep_hours_per_day',
       'farthest_distance_travelled', 'fav_number', 'internet_hours_per_day',
       'only_child', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age',
       'height', 'height_father', 'height_mother', 'procrastinator',
       'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before',
       'childhood_in_US', 'cant_change_math_ability',
       'can_change_math_ability', 'math_is_genetic',
       'much_effort_is_lack_of_talent']

Note that your function must be general, i.e., it must work for any other dataset as well. This means, you cannot hard code a column name, or anything specific to survey_data in the function.